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Abstract. We derive new characterizations of the matrix <f>-entropies introduced in [Electron. J. Probab., 
19(20): 1-30, 2014]- These characterizations help to better understand the properties of matrix <f>- 
entropies, and are a powerful tool for establishing matrix concentration inequalities for matrix-valued 
functions of independent random variables. In particular, we use the subadditivity property to prove a 
Poincare inequality for the matrix ^-entropies. We also provide a new proof for the matrix Efron-Stein 
inequality. Furthermore, we derive logarithmic Sobolev inequalities for matrix-valued functions defined on 
Boolean hypercubes and with Gaussian distributions. Our proof relies on the powerful matrix Bonami- 
Beckner inequality. Finally, the Holevo quantity in quantum information theory is closely related to the 
matrix ^-entropies. This allows us to upper bound the Holevo quantity of a classical-quantum ensemble 
that undergoes a special Markov evolution. 


1. Introduction 

The introduction of classical d>-entropies can be traced back to the early days of information theory, 
where the notion of (^-divergence [1] and convex analysis [2-4] is defined. Formally, classical 4>-entropies 
refer to the set of functions 4> : [0, oo) —)• R that are continuous, convex on [0, oo), twice differentiable on 
(0, oo), and either is affine or is strictly positive and l/U is concave. This set includes rich members, 
e.g. x p , p € [1, 2], and xlogx. For every nonnegative integrable random variable Z, the <f>-entropy function 
Hq, is defined as 

H$(Z) = E <Z>(Z) - $(E Z). 

It can already be seen that the variance and the entropy function of Z correspond to H X 2 (Z) and 
H x \ ogx (Z), respectively. 

The investigation of the general properties of the <h-entropies has enjoyed great success because it 
unifies the study of concentration inequalities [5]. Of these, the subadditivity property of the <f>-entropies 
has led to the derivation of <h-Sobolev inequalities, generalizing the logarithmic Sobolev (i.e. log-Sobolev) 
and Poincare inequalities, which in turn, is a crucial step toward the powerful Bousquet’s inequality [6, 
7]. Let Z = f(X i, ■ ■ ■ ,X n ), where X\, ■ ■ ■ ,X n are independent random variables, and / is a nonnegative 
function. We say H$(Z) is subadditive if 

n 

Hz(z)<J2vh%\z) 

i— 1 

where (Z) = Ej4>(Z) — <h(EiZ) is the conditional entropy, and E,; denotes conditional expectation con¬ 
ditioned on the n — 1 random variables = (Xi • • • , X,;-i, X t +\, • • • , X n ). The subadditivity property 
and the definition of ^-entropies are intimately connected to each other. In fact, one can show that they 
are equivalent [8, 9]. Furthermore, more equivalent statements between the subadditivity of the classical 
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<L-entropies and convexity of other forms of the function <f> have also be established, which proves to be 
useful in other contexts as well [8, 9]. 

In many branches of science and engineering, observed data are more efficiently represented as matrices, 
and system performance can be concluded from analysis where random assumption is placed on the 
matrices. The subject of random matrices has undergone extensive studies, independently conducted in 
their own discipline, in the 20th century. A recent review of modern random matrix theory by Tropp 
surveyed the most successful methods from these areas, and provided interesting examples that these 
techniques can illuminate [10]. It is clear that the future development of random matrix theory will 
benefit from a unified and systematical treatment. 

Parallel to the classical entropy method, Chen and Tropp defined the class of trace matrix ‘h-entropies 
[11]. The class of matrix entropy functions can be equivalently characterised in terms of the second 
derivative of their representing function, and this equivalent statement easily guarantees that the set of 
matrix entropies is therefore a convex cone [12]. Consider a random matrix Z taking values in Ml", with 
EllZHoo < oo and E||$(Z)|| 00 < oo. The matrix 4>-entropy functional H$ is defined as 

H*(Z) = t? [E$(Z) - $(E Z)\ , 

where tr is the normalized trace [11]. The matrix ‘h-entropies are subadditive. Unlike its classical counter¬ 
part, very little connection between the matrix <3?-entropies and other convex forms of the same functions 
in the class is established [12, 13]. 

In this paper, we prove new characterizations of the class of matrix •h-entropies defined in [11], Fur¬ 
thermore, our results show that matrix d>-entropies satisfy all known equivalent statements that classical 
4>-entropy functions satisfy [5, 8, 9]. Our results provide additional justification to its original definition 
of the matrix d>-entropies, which is a matrix generalization of the classical ^-entropies. The equivalences 
between matrix <L-entropies and other convex forms of the function $ will help to understand the class 
of entropy functions and to unify the study of matrix concentration inequalities in the future. 

The study of log-Sobolev, Poincare and hypercontractivity was originally motivated by the question of 
how fast a Markov process and Markov chain can mix, e.g., the mixing property. Interested readers can 
find detailed discussion in, e.g., Ref. [14]. The pioneering work by Gross in 1975 derived the log-Sobolev 
inequalities for the balanced Bernoulli and Gaussian distributions, and determined the optimal constant 
in the former case [15]. Bonami and Beckner independently discovered a version of hypercontractivity 
inequality. Since then, it has become a rich area of research that influences the investigation of, among 
others, a non-asymptotic theory of concentration inequalities. 

Log-Sobolev and hypercontractivity have also emerged as a powerful tool in quantum information the¬ 
ory [16-23], and have found many applications [20]. In quantum field theory, hypercontractivity is used 
to prove tighter mixing time bounds in the analysis of certain physics systems whose evolutions can be 
modelled as continuous time Markov processes [16, 17, 21]. In theoretical computer science, hypercontrac¬ 
tivity is used to bound the local influence of a binary bit to the total matrix-valued function [18, 19]. A 
remarkable step was made by King [22] who generalizes classical hypercontractivity for the boolean cube 
to non-commutative quantum hypercontractivity for the depolarizing channel. 

We extend those studies of non-commutative Poincare inequalities and log-Sobolev inequalities to 
the random matrix framework. In other words, we establish matrix concentration inequalities, such 
as Poincare and log-Sobolev inequalities, for the matrix 4>-entropies. This framework is particularly use¬ 
ful for quantum information processing tasks that involve quantum systems carrying classical labels since 
a classical-quantum ensemble is a special type of a random matrix. As a result, matrix concentration 
inequalities developed in this work can help to provide a better illustration of the improvement caused by 
quantum communication or computation in the finite regime. As an example, we study how the Holevo 
quantity of a classical-quantum ensemble changes if the ensemble evolves according to a classical Markov 
kernel on its classical labels and a post-selection rule. 

1.1. Our Results. The contribution of this paper is threefold. 

(1) First, we derive equivalent characterisations of the matrix 4>-entropies in Table 1. Notably, all 
known equivalent characterizations for the classical <L-entropies can be generalised to their matrix 
correspondences. 
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Classical 4>-Entropy Class (Cl) 

Matrix 4>-Entropy Class (C2) 

(a) 

4> is affine or > 0 and l/4>" is concave 

4> is affine or D4>' is invertible and (D^')" 1 is concave 

(b) 

convexity of («, v) (->• 4>(« + v) — 4>(u) — 4>'(u)u 

convexity of ( u , v) i->- Tr [4>(u + v) — 4>(w) — D$[m](u)] 

(c) 

convexity of (u, v) ha (<U(u + v) — 4> , (u))t; 

convexity of {u,v) >-> Tr [(D4 >[m + u](d) — D4>[n](i;)] 

(d) 

convexity of ( u,v ) &'(u)v 2 

convexity of (u,v) >->• Tr [D 2 4 >[m](u,u)] 

(e) 

4> is affine or 4>" > 0 and $ ,/,, 4>" > 24> / " 2 

Equation (3.2) 

(f) 

convexity of ( u,v ) (->• i$(u) -f (1 — t)4>(u) 

— + (1 — t) v) for any 0 < t < 1 

convexity of {u,v) ha Tr[t4>(it) + (1 — t)4>(u) 

—(tu + (1 — t)v)} for any 0 < t < 1 

(g) 

E 1 F$(Z|X 1 ) > H^xZ) 

E 1 F 4 ,(Z|X 1 ) > iZ^EiZ) 

(h) 

H$(Z) is a convex function of Z 

H<j,{Z) is a convex function of Z 

(i) 

H*(Z) = sup T>0 {E [Ti(T) ■ Z + T 2 (T)]} 

h*{Z)<y.U 1 zh®{z) 

H$(Z) = sup TM) {trE [TT(T) • Z + T 2 (T)]} 

0) 



Table 1. Comparison between the equivalent characterizations of <h-entropy functional 
class (Cl) (Definition 3.1) and matrix 4>-entropy functional class (C2) (Definition 3.2) 


We emphasize that additional characterizations of the <f>-entropies prove to be useful in many 
instances. The characterizations (b)-(d) in (Cl) are explored by Chafai [9] to derive several 
entropic inequalities for M/M/oo queueing processes that are not diffusions. With the character¬ 
izations (b)-(d), the difficulty of lacking the diffusion property can be circumvented and replaced 
by convexity. The characterization (j) in (Cl), also known as the subadditivity property, plays 
a crucial role in deriving powerful entropic inequalities for functions of a series of independent 
(not necessary identical) random variables, including Poincare inequalities, Sobolev inequalities, 
logarithmic Sobolev inequalities, and Bousquet’s inequalities [6, 7]; while characterizations (g) and 
(i) are key steps to obtain (j) in (Cl). 

On the other hand, the characterization (b) in (C2)—the joint convexity of matrix Bregman 
divergence—is used to derive a sharp inequality for the quantum Tsallis entropy of a tripartite 
state. This is considered as a generalization of the strong subadditivity of the von Neumann 
entropy [13]. In this work, the characterization (j) in (C2) is shown to be crucial in deriving 
various entropic inequalities for matrix ^-entropies, including matrix Poincare inequalities and 
matrix Sobolev inequalities. Likewise, characterizations (g) and (i) are key steps to obtain (j) in 
(C2) [11], 

(2) After establishing equivalent characterizations of matrix ^-entropies, we move on to derive matrix 
concentration inequalities, including matrix Poincare inequalities, matrix Sobolev inequalities, and 
matrix logarithmic Sobolev inequalities. A fundamental tool used in the proofs is the subadditivity 
property of the matrix •L-entropies. 

• Poincare inequality: We prove a Poincare inequality for matrix-valued functions in Theorem 
4.2 (see also Corollary 4.1), generalizing the classical Poincare inequality [24, 25]: 


( 1 . 1 ) 


Var(/(X))<E ||V/(X)|| 


where X = (X\,...,X n ) denotes an independent random vector; each Xi taking values 
in the interval [0,1]. Our proof, paralleling its classical counterpart, relies on the matrix¬ 
valued Efron-Stein inequality (Theorem 4.1). Both Theorem 4.2 and Corollary 4.1 recover 
the classical Poincare inequality (1.1) when the matrix dimension d = 1. 

We also derive various Poincare inequalities for matrix-valued functions with additional as¬ 
sumptions such as pairwise commutation (Corollary 4.1) or Lipschitz functions (Corollary 
4.2). Finally, we derive a matrix Gaussian Poincare inequality for Gaussian Unitary Ensem¬ 
bles (Theorem 4.3). 

• 4>-Sobolev inequality: We prove a restricted 4>-Sobolev inequality for matrix-valued functions 
defined on the Boolean hypercube in Theorem 4.5, from which we can also extend to a <h- 
Sobolev inequality for Gaussian distributions (Theorem 4.6). Our 4>-Sobolev inequality is 

3 











defective (see Remark 4.5 for the discussions of tight and defective <F-Sobolev inequalities), 
but again it recovers the classical <f>-Sobolev inequality when d = 1. Our proof builds upon a 
powerful matrix Bonami-Beckner inequality [18], from which the hypercontractivity inequality 
for matrix-valued functions on Boolean hypercubes can be obtained. 

The matrix logarithmic Sobolev inequalities in Corollaries 4.3 and 4.4 follow immediately 
from Theorems 4.5 and 4.6. 

(3) Finally, we connect matrix <h-entropies to quantum information theory. When <3?(x) = xlogx and 
the random matrix px = {p(x), p x } x& x, where each p x P 0 and Tr p x = 1, is a classical-quantum 
ensemble, H$(px) is equal to the Holevo quantity x{{PiP}) ( U P to a constant dimensional factor 
for only technical purposes). If the ensemble py = {q(y), cr y } y£ y is obtained by evolving px with 
a Markov kernel K(y\x): 

q(y) = ^ Z p ^ K (y\x ) 

X 

a y = ^Px K *(x\y) 

X 

where K*(x\y) is the backward channel of K , then the Holevo quantity of x({p, p}) is bounded 
from above by a constant c times the average Holevo quantity of the ensembles that come from 
post-selecting the original {p, p} by the postselection rule K*. Moreover, the constant c is related 
to the ratio of the Holevo quantities x({p, p}) and x({<Z)C}) (see Proposition 5.1). This bears a 
stronger form of the classical strong data processing inequality [26, 27]. 

1.2. Prior work. 

(1) For the history of the equivalent characterizations in the class (Cl), we refer to an excellent 
textbook [5] and the papers [8, 9]. 

The original definition of the matrix <F-entropy class; namely (a) in (C2), is proposed by Chen 
and Tropp in 2014 [11], In the same paper, they also establish the subadditivity property (j) 
through (i) and (g): (a)=£-(i)=>(g)=>-(j). Shortly after, the equivalent relation between (a) and the 
joint convexity of the matrix Bregman divergence (b) is proved in [13]. The equivalent relation 
between (a) and (d) is almost immediately implied by the result in [12] (see the detailed discussion 
in the proof of Theorem 3.3). 

(2) Very few matrix concentration results have been established for general matrix-valued functions of 
independent random variables. To the best of our knowledge, the only gem in this area is a family 
of polynomial Efron-Stein inequalities for random matrices [28], where the theory of exchangeable 
pairs is used in the proof. In this work, we use the subadditivity property of the matrix T-entropies 
to derive an Efron-Stein inequality in Theorem 4.1, which is a special case of the result in [28] (for 
square functions). Note that in this special case, the constant in our theorem is better than that 
in [28]. 

We would also like to point out that the matrix T-entropies defined in the paper are different 
from the entropy functions in the non-commutative E p spaces in [16-23]. Currently, we do not 
know how to relate these two definitions. Hence, our matrix concentration inequalities in Section 4 
are incomparable with those in the non-commutative E p spaces. 

We organize the paper in the following way. Section 2 reviews Matrix Algebra necessary for the 
remaining of the paper. We show new characterizations of matrix T-entropies in Section 3. We then 
derive matrix concentration inequalities for matrix •F-entropies in Section 4. We connect matrix <F- 
entropies to quantum information theory in Section 5, and derive an upper bound on the Holevo quantity. 
Appendix A provides useful lemmas. We review the classical Bonami-Beckner inequality in Appendix B. 

2. Preliminaries 

In this section we present the background information necessary for this paper. Basic notations are 
introduced in Section 2.1. We then review operator algebra with a focus on Frechet derivatives and 
convexity properties of matrix-valued functions in Section 2.2 and 2.3, respectively. 
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2.1. Notation. Given a separable Hilbert space FL , denote by M the Banach space of all linear operators 
on FL. The set M sa refers to the subspace of self-adjoint operators in M. We denote by M + (resp. M ++ ) 
the set of positive semi-definite (resp. positive-definite) operators in M sa . If the dimension d of a Hilbert 
space FL needs special attention, then we highlight it in subscripts, e.g. denotes the Banach space 
of d x d complex matrices. For each interval I C H, we define the set of self-adjoint operators whose 
eigenvalues he in the interval to be: 

M sa (I) = {M € M sa : [A min (M), A max (M)] C /} , 

where A m i n (Xf) and A max (Xf) are the minimal and maximal eigenvalues of M, respectively. 

The trace function Tr : M —» C is defined as 

TV [M] = e* k Me k for M £ M 

k 

where (e k )k is any orthonormal basis on FL. If we focus on finite dimensional Hilbert spaces, then the 
trace function acting on M is equal to the summation of its eigenvalues. In this paper, we introduce the 
normalised trace function tr for every matrix M £ as 

tr [M] = ^ Tr [M] . 

The normalised trace function enjoys a convexity property (see Lemma 2.1), which will be convenient for 
later derivations in this paper. 

For p € [1, oo), the Schatten p-norm of an operator M € M is denoted as 

/ \ i/p 

(2-1) ||M|| P 4 ( ^^(MJIPj , 

' i ' 

where {Aj(IVT)} are the singular values of M. We also define the supremum norm of a (finite or infinite) 
matrix M 6 M as 

||M|| sup = sup 

i,j 

Define S n to be the set of all mutually commuting n-tuple self-adjoint operators; namely, if X = 
(Xi,...,X n ) € S n , then [X,;, X ? ] = 0 for i / j € [n]. We denote by H 16 se l °1 mutually commuting 
n-tuple d x d Hermitian matrices. 

For A,Bg M sa , Ay B means that A — B is positive semi-definite. Similarly, Ay B means A — B 
is positive-definite. 

Throughout this paper, italic capital letters (e.g. X) are used to denote operators, and non-italic ones 
(e.g. X) are used to denote a collection of, say n, operators. 


2.2. Matrix Calculus. In this section, we only provide sufficient information for the matrix calculus. 
For a general treatment of this topic, interested readers can refer to [29, Section 2.1], [30, Chapter 17], 
[31, Section X.4], [32, Section 5.3], and [33, Chapter 3]. 

Let U, W be real Banach spaces. The Frechet derivative of a function C : U —)• W at a point X € U, if 
it exists 1 , is a unique linear mapping D£[X] : U —> W such that 


\\C(X + E)-C(X)-DC[X](E)\\ W :Q 

\\E\\u 


cis E £ 1A and 11 E 11 —y 0, 


or, equivalently, 

||£(X + E) - C(X) - DC[X](E)\\ W = o(\\E\\ u ), 

where || • ||^(w) is a norm in U (resp. W). The notation D £[X](E) then is interpreted as “the Frechet 
derivative of £ at X in the direction E”. Furthermore, the Frechet derivative implies the Gataux derivative 
such that the differentiation of £(X + t,E) with respect to the real variable t is 


C{X +tE)~ C{X) 
t 


D C[X](E) 


as t —>• 0. 


assume the functions considered in the paper are Frechet differentiable. The readers can refer to, e.g. [34, 35], for 
conditions for when a function is Frechet differentiable. 
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For example, if the operator-valued function is the inversion £{X) = X 1 for each invertible matrix X , 
then (see e.g. [31, Example X.4.2]) 

(2.2) D£[X](y) = -X^YX" 1 for all Y G M. 

The Frechet derivative also satisfies several properties similar to conventional derivatives of real-valued 
functions (see e.g. [33, Theorem 3.4]): 

Proposition 2.1 (Properties of Frechet Derivatives). Let U, V andW be real Banach spaces. 

1. (Sum Rule) If £\ : Id —>• W and £2 : Id —>• W are Frechet differentiable at A G U, then so is 
C = aC 1 + fi£ 2 and D £[A](E) = a ■ DCi[A](E) + p ■ D £ 2 [A](E). 

2. (Product Rule) If £\ : U —>• W and £ 2 : 77 —>• W are Frechet differentiable at A £ Id and assume 
the multiplication is well-defined in W, t/ien so is £ = £\ ■ C 2 and D £[A](E) = D£\[A](E) ■ 
£ 2 (A) + CfiA) ■ D£ 2 [A\(E). 

3. (Chain Rule) Let £\ : U —)• V and C 2 : V ^ W be Frechet differentiable at A £ U and £\ (A) 
respectively, and let £ = C 2 o £\ (i.e. £(A) = £ 2 (£\(A)). Then £ is Frechet differentiable at A 
and D £[A](E) = D£ 2 [£i(A)] (D£i [A] (£?)). 

Similarly, the m-th Frechet derivative D m £[X] is a unique multi-linear map from ld m = Id x • • • x U 
(m times) to W that satisfies 

HD™- 1 ^ + E m \(E u ..., - D m - 1 £[X](£ 1 ,..., ^ m _!) 

- D m £[X](E u ... ,E m )\\ w = o(\\E m \\u) 


for each Ei € Id, i = 1,..., m. If D rn £[X] is continuous at X, then the m-th Frechet derivative can be 
expressed as a mixed partial derivative [36, Section 9] (see also [37, Theorem 2.3.1]). 


D m £[X}(E 1 ,...,E m ) 


d 

ds\ 


d 

ds m 


£ (X + si Eli + • • • + s m E m ). 

Sl = " - = Sm = 0 


We can observe, from the above equation, that the m-th Frechet derivative is symmetric about all E t : see 
[38, Theorem 8], [31, p. 313], and [39, Theorem 4.3.4], We refer to Refs. [40, Section 8.12], [30, Chapter 
17], [39, Section 4.3], and [41] for further information about higher order Frechet derivatives. 

The following proposition relates the second order Frechet derivative with the convexity of a matrix¬ 
valued function, i.e. £(tA) + £((1 — t)B) f £(tA + (1 — t)B) for all 0 < t < 1. 


Proposition 2.2 (Convexity of twice Frechet differentiable matrix functions [42, Proposition 2.2]). Let 
U be an open convex subset of a real Banach space U, and W is also a real Banach space. Then a twice 
Frechet differentiable function £ : U —» W is convex if and only if D 2 £(X)(h, h) F 0 for each X € U 
and h € U. 


The partial Frechet derivative of multivariate functions can be defined as follows [32, Section 5.3]. Let 
U, V and W be real Banach spaces, £ :U x V —>• W. For a fixed vq G V, £(u, vf) is a function of u whose 
derivative at uq, if it exists, is called the partial Frechet derivative of £ with respect to u, and is denoted 
by D u £[uq,v$\. The partial Frechet derivative D v £[uq,vo] is defined similarly. 

The Frechet derivative and the partial Frechet derivative can be related as follows. 

Proposition 2.3 (Partial Frechet derivative [32, Proposition 5.3.15]). If £ : U x V —>• W is Frechet 
differentiable at (X,Y) £ld xV, then the partial Frechet derivatives D x£\X,Y] and D y£[X,Y\ exist, 
and 

D £[X,Y](h,k) = D x £[X,Y](h) + D Y £[X,Y](k). 

Now let £ : Ld n —> VV and assume it is a holomorphic function (i.e. Frechet differential in a neighborhood 
of every point in its domain), then the Taylor expansion £(X + E) at X = {X%, ..., X n ) G ld n can be 
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expressed as 


£(X + E) = £(X) + £ -D k £ [X] (£ 1 __E) 


(2.3) = £(X) 

For any map £ : U —>■ W 
D C[X\ as 

(2.4) 


n i n n 

+ X D ^ £ I x ]<£,■>+2! XX D^Xa; (-^H -®fe) + Remaining terms. 
i=i ' j= i fc=l 

and an operator X € , we define the induced norm of the Frechet derivative 


||D£[X]|| 4 sup 

E^O 


wrnxrnw 

\\E\\ 


where the norm can be any consistent norm (e.g. ||D£[X]|| 2 = sup £ ^ 0 ||D£[X](E)|| 2 / ||E|| 2 ). 

The norm of the Frechet derivative is closely related to the condition numbers, which measure the 
sensitivity of an operator-valued function to perturbations in the variables. Hence, the absolute condition 
number is defined by 


(2.5) 


cond abs (£,X) = lim 

e—>0 i 


sup 

■E||<e 


\\C(X + E)-X\ 


Then the norm of the Frechet derivative can be expressed by the absolute condition number [43] 


cond abs (£,X) = ||D£[X]||. 


We note that there are several algorithms and software packages that can compute the absolute condition 
number; see [33, Section 3], [44] and references therein. 


2.3. Standard Matrix Functions. In this section, we restrict our considerations to the standard matrix 
functions. Coupled with the techniques of matrix calculus described above, standard matrix functions 
enjoy additional properties which are useful throughout this work. 

For each self-adjoint and bounded operator A € M sa with the spectrum cr{A) and the spectral measure 
E, the spectral decomposition is given as 

(2.6) A = [ XdE(X). 

J\£cr(A) 

Hence, each scalar function can be extended to a standard matrix function as follows. 

Definition 2.1 (Standard Matrix Function). Let /:/—>• R be a real-valued function on an interval / of 
the real line. Suppose that A € M sa (J) has the spectral decomposition (2.6). Then 

f(A) = f /(A)dE(A). 

From this equation, it is clear that a(f(A)) = f(a(A)), which is called the spectral mapping theorem. 
Note that we use lowercase Roman and Greek letters to denote standard matrix functions, while 
calligraphic capital letters £ refer to general operator-valued functions that are not necessarily standard. 

The spectral mapping theorem of standard matrix functions immediately yields the following convexity 
property of the normalised trace function. 

Lemma 2.1 (Convexity Lemma for Normalised Trace Functions). For every convex function f : I -A 1R 
and every matrix A € M™(/), we have the following relation 

tr f(A) > /(tr A). 
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Proof. The convexity lemma follows from the fact that the normalised trace is a convex combination of 
the eigenvalues. More specifically, 

t rf(A) = Xi(f(A)) = /(M A )) 


where in the second identity we use the spectral mapping theorem and the third relation is due to the 
convexity of /. □ 

A function / : / —>• 1R. is called operator convex if for each A.B G M sa (J) and 0 < t < 1, 

f(tA) + /((1 - t)B) A f(tA + (1 - t)B). 

Similarly, a function /:/—>• 1R is called operator monotone if for each A, B G M sa (I), 

APB ^ f(A) P f(B). 

It is remarkable that not all convex (resp. monotone) functions are operator convex (resp. monotone). 
For example, the exponential function is not operator convex nor operator monotone on [0, oo); the power 
functions that are operator convex are f(x) = x p for p G [—1,0] U [1,2] and f(x) = —x p for p G [0,1]. 
However, the trace function on M sa given by A —> Tr[/(A)] preserves the convexity or nronotonicity. 

Proposition 2.4 (Convexity and Monotonicity for Trace Functions [45, Section 2.2]). Consider a real¬ 
valued function f : I R. If f is convex (resp. monotone) on U C I, then the function A —> TV[/(A)] 
is convex (resp. monotone) on M sa (C7). 



Ai(A) = /(tr A), 


We refer the readers to Refs. [37] and [46] for general expositions to operator convex and monotone 
functions. 

If the scalar function is continuously differentiable, then it is convenient to introduce the following two 
properties for the trace function of Frechet derivatives. 


Proposition 2.5 ([46, Theorem 3.23]). Let A, X G M sa and t € R. Assume f : I —> R is a continuously 
differentiable function defined on interval I and assume that the eigenvalues of A + tX C I. Then 


f t Tm A + tx ) 


TV [Xf'(A + t 0 X)}. 


Lemma 2.2. Let A,X,Y € M sa and t G 1R. Assume f : I — >• R is a continuously differentiable function 
defined on interval I, and assume that the eigenvalues of A + tX C /. Then 


TV(D 2 f[A](X,Y)) = (X,D f'[A](Y)) 
= (Y,Df[A](X)). 


Proof. By applying Fubini’s theorem, we can interchange the order of trace and Frechet derivative to 
obtain 


Tr(D 2 f[A\(X,Y)) 



d 2 


Tr 

[ et 8s PA+sX+tY) 

4=0, s=0 


d_ 

dt 


d_ 

ds 


Tr f(A + sX + tY) 


s=0Jt=0 


^Tr [Xf\A + tY)\ 

TV[X-D f[A](Y)} 
(X,D f[A](Y)), 


where the third identity follows from Proposition 2.5. 











Since D 2 f[A] is symmetric in the sense that D 2 f[A](X ,Y) = D 2 f[A](Y, X), the second equation 
follows similarly. □ 

It is useful to express the Frechet differentiation as the divided differences. Let 01 , 02 ... be distinct 
real points on I C R. Then we define the zero and first order divided differences as 

/ [0] (oi) 4 /( a,), /W(ai,a 2 ) 4 /(ai) ~ /(a2) , 

Oi — 02 

and n-th order divided difference 

f [n], a s A / [n ~ 1] ( Q l ; a 2, . . . ,Q n ) - / [n ~ 1] (a 2 , «3, • • • ,Qn+l) 

a\ — a n+ i 

Therefore, the following formula gives the explicit form for the Frechet derivative of a standard matrix 
function. 


Proposition 2.6 (Daleckii and Krein Formula [33, Theorem 3.11]). When f is first-order differentiable 
on I C R and A = diag(a\,... ,a^) is diagonal in M^ a (J), then the first-order Frechet derivative D/[A] 
at A can be written as 


D f[A}(X) 


/ [1] («o«i) 


J i,j =1 


©X, 


where © denotes the Schur product. Moreover, as f is second-order differentiable on I, the corresponding 
second-order Fechet derivative is 


D 2 f[A](X,Y) 


E/ [2] ( {.A-iyfikj T YifcXkj) 


_ k =1 


i,j =1 


The standard matrix function can be extended into the multivariate case by considering n-tuples 
commuting self-adjoint operators. 


Definition 2.2 (Multivariate Matrix Function). Let X = (X\,..., X n ) € S n (I) be an n-tuples com¬ 
muting self-adjoint operators with the spectral decomposition Xi = \dEi(X). Since Xf s commute, 
define the product spectral measure on I = I\ x ■ • • x I n as E{ Ai,... , A„) 4 E(\\) ■ • • E(\ n ). Then, 

/(X)4 / /(Ax,..., A n ) dE(Xi ,..., A n ). 

d (Ai,...,An)eX 


For each i = i,... ,n, the (first-order) divided difference of multivariate matrix function / : R n —5- IEL is 
defined by the rule 

f ( - {f{x-y))(xi-yi) _ f df 

Vi [X, y) = -yp-TT72-for xfiy and tp J k (x, x) = — (x ), 

\\x-y\\i K oxi 


where x = (x \,..., x n ) € IR, n . We will just denote it by ipi if no confusion is possible. 

Via the introduced divided difference function, Proposition 2.6 can be extended to multivariate matrix 
functions. More precisely, for each i = 1,..., n, the partial Frechet derivative can be expressed as 

(2.7) D Xi f[X](E)=[<p i (X k ,X l )] kl QE, VX G S n and E € M sa , 


where the {X,}( 1 =1 are simultaneously diagonalised with Ajfc being the k- th eigenvalue of X* and A& = 
(Aifc) • • • j Ajfc,..., X n ]f). 


Remark 2.1. To the best to our knowledge, there are four types of definitions for the multivariate standard 
matrix functions. Our treatment of multivariate standard matrix functions on n-tuples commuting self- 
adjoint operators originates from Lieb and Pedersen [47, 48] (see also [49, 50] for applications). Hansen et 
al. also introduced an alternate approach by viewing /(X) as an operator in M sa © • ■ ■ © M sa [42, 51-54]. 
Kressner [55] proposed a method by considering /(X) as an super-operator on M sa © ■ ■ ■ © M sa . Very 
recently, there was a non-commutative generalisation of the multivariate standard matrix; see [56-59]. 
However, the results are still limited. 0 
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3. New Characterizations of Matrix ^-Entropy Functionals 

We first introduce classical 4>-entropies and its subadditivity property in Section 3.1. The readers may 
refer to [8, 9, 25, 60, 61] and [5, Chapter 14] for more comprehensive discussions. Then, we move on to 
introduce matrix 4>-entropies in Section 3.2, and present the main result (Theorem 3.3) of this section; 
namely, new characterizations of the matrix 4>-entropies. 

3.1. Classical 4>-Entropies. Let (Cl) denote the class of functions 4> : [0, oo) -A IR that are continuous, 
convex on [0, oo), twice differentiable on (0,oo), and either 4> is affine or <f >,/ is strictly positive and l/4>" 
is concave. 

Definition 3.1 (Classical ( I>-Entropies). Consider 4> E (Cl). For every non-negative integrable random 
variable Z so that E|Z| < oo and E|<1>(Z)| < oo, the classical 4>-entropy H$(Z) is defined as 

H$(Z) = E4>(Z) - 4>(E Z). 

In particular, we are interested in Z = f(X\, ■ ■ ■ , X n ), where X \, • • • ,X n are independent random vari¬ 
ables, and / > 0 is a measurable function. 

We say H$(Z) is subadditive [25] if 

n 

1=1 

where H^\z) = Ej4>(Z) — 4>(Ej Z) is the conditional 4>-entropy, and Ej denotes conditional expectation 
conditioned on the n— 1 random variables X_j = (X\ • • • , Xj_i, W+i> ■ ■ ■ , X n ). Sometimes we also denote 
H%\Z) byH*{Z\X-i). 

It is a well-known result that, for any function 4> E (Cl), H$(Z) is subadditive [61, Corollary 3] (see 
also [60, Section 3]). 

The following theorem establishes equivalent characterizations of classical <J>-entropies. 

Theorem 3.1 ([9, Theorem 4.4]). The following statements are equivalent. 

(a) 4> E (Cl): 4> is affine or > 0 and l/$" is concave; 

(b) Bregman divergence (u,v) i-A 4 >(u + v) — 4? (it) — $'(u)v is convex; 

(c) (u,v) i-A (4>'(it + v) — $>'(u))v is convex; 

(d) (u,v) eA &"(u)v 2 is convex; 

(e) 4> is affine or 4>" > 0 and &"'<&" > 24>" /2 ; 

(f) ( u , v) eA f4>(it) + (1 — t)$(v) — 4>(tu + (1 — t)v) is convex for any 0 < t < 1; 

(g) ’E 1 H$,(Z\Xi) > H$(EiZ); 

(h) {H$(Z)}$ e ( C1 ) forms a convex set; 

(i) Hq,{Z) = sup T>0 (E [(4>'(T) - 4>'(E T))(Z - T)} + H*(T)); 

(j) H^Z)<YT i= ^Hf{Z). 

3.2. Matrix ^-entropies. Chen and Tropp introduce the class of matrix <3?-entropies, and prove its 
subadditivity in 2014 [11], In this section, we will show that all equivalent characterizations of classical 
4>-entropies in Theorem 3.1 have a one-to-one correspondence for the class of matrix ^-entropies. 

Let d be a natural number. The class 4^ contains each function 4> : (0, oo) -A IR that is either affine or 
satisfies the following three conditions: 

(1) 4> is convex and continuous at zero. 

(2) 4> is twice continuously differentiable. 

(3) Define ’F(f) = &'(t) for t > 0. The Frechet derivative DT of the standard matrix function 

T : -a M^ a is an invertible linear map on and the map A i-a (Dd^A])” 1 is concave 

with respect to the Lowner partial ordering on positive definite matrices. 

Define (C2) 4 = fl^i $d- 
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Definition 3.2 (Matrix <h-Entropies [11]). Let $ G $oo. Consider a random matrix Z G with 
EllZHoo < oo and E||$(Z)|| 00 < oo. The matrix 4>-entropy H${Z) is defined as 

Hq(Z) = tr [E4>(Z) - 4*(E Z)} . 

The corresponding conditional matrix <h-entropy can be defined under the cr-algebra. 

Theorem 3.2 (Subadditivity of Matrix <h-Entropies [11]). Let <h G (C2), and assume Z is a measurable 
function of (Xi,..., X n ). 

n 

(3.1) H*(Z)<J2®[h£\z)\, 

i=l 

where H^\z) = E iA(Z) — <h(E iZ) is the conditional entropy, and E* denotes conditional expectation 
conditioned on the re — 1 random matrices X_j = (Xi,..., Xj_i, Xj /+ \ ,..., X n ). 

The following theorem is the main result of the section. It establishes that all the equivalent conditions 
in Theorem 3.1 also hold for the class of matrix <h-entropies. 


Theorem 3.3. The following statements are equivalent. 

(a) G (C2): <h is affine or Df is invertible and A H > (D\I/[A]) _1 is operator concave; 

(b) Matrix Bregman divergence: (A. B) >-)■ Tr[<h(A + B) — T(A) — D<h[A](i?)] is convex; 

(c) (A, B) i—^ Tr[D4>[A + B\(B) — D4>[A](i?)] is convex; 

(d) (A, B) i-)- Tr[D 2 <h[A](£L B)] is convex; 

(e) $ is affine or <f> // > 0 and 


(3.2) 


Ti- 


fi • (DT[A]) _1 o D : %[A] (k, k, (DT[A]) _1 (h)) 


> 2 Ti¬ 


ll • (DT[A]) _1 o D^[A] ( k, (DT[A]) _i ( D 2 T[A] ( k, (DT[A]) 


-l 


,-l 


for each A ^ 0 and h,k G M^ a ; 

(f) (A, B) Tr[t<h(A) + (1 — t)$(B) — 4>(fA + (1 — t)B)\ is convex for any 0 < t < 1; 

(g) E 1 ^$(Z|X 1 ) > H$(EiZ); 

(h) {H$(Z)}$ e ( C 2 ) forms a convex set of convex functions; 

(i) H*{Z) = sup TM) (trE [($'(T) - <L'(E T))(Z - T)} + H^T)); 

(j) H*(Z)<j:f = 1 EH%\z). 


Proof, (a) => (i) => ( g ) (j): This statement is proved by Chen and Tropp in [11]. 

(a) (6): This equivalent statement is proved in [13, Theorem 2], 

(a) ( d ): Theorem 2.1 in [12] proved the equivalence of (a) and the following convexity lemma. 


Lemma 3.1 (Convexity Lemma ). [11, Lemma f.2] Fix a function G l I> 00 , and let T = d>'. 
Suppose that A is a random matrix taking values in and let X be a random matrix taking 

values in M^ a . Assume that ||A||, ||X|| are integrable. Then 

E (X, DT[A](X)) > (E[X],DT[EA](EX)) . 

What remains is to establish equivalence between the convexity lemma and condition (d). This 
follows easily from Lemma 2.2: 

(3.3) Tr(D 2 <h[A](X,X)) = (X,D$'[A](X)>, 

Remark 3.1. In Ref. [11, Lemma 4.2], it is shown that the concavity of the map: 

A i-)- ^X (DT[A]) _1 (X)^} , VXGMf 

implies the joint convexity of the map (i.e. Lemma 3.1). 

(X,A)^(X(DvH[A])(X)). 


(3.4) 


li 







In Ref. [12, Theorem 2.1], the author provided another proof for the above implication (i.e. Lemma 
3.1). It is worth mentioning that the proof in [12, Theorem 2.1] originated from [46, Theorem 
4.21]. Moreover, the converse direction of Lemma 3.1 follows simply from the argument in [46, 
Theorem 4.21], It is also noted in [12] that the joint convexity of Eq. (3.4) indicates the entropy 
class <hoo being a convex set. Therefore, the set of matrix 4>-entropy functionals forms a convex 
cone. 

(6) (c) (d): Define A$, x —> 1R. as 

A$(u, v ) = Tr[<l?(tt + v) — <I>(it) — D4>[u](u)] 

B$(u, v ) = Tr[D4>[it + u](i;) — D$[tt](u)] 

C$(u,v) = Tr[D 2 4>[u](t;, u)]. 


(3.5) 

(3.6) 


(3.7) 

(3.8) 


Following from [9], we can establish the following relations: for any ( u,v) € x M^, 

A$(u,v) = / (1 — s)C$(it + sv, u)ds 

Jo 

B$(u,v) = / C$(u + sv, u)ds, 

Jo 


and for small enough e > 0, 

A$(u,ev) = ^C$(u, r)e 2 +o(e 2 ); 

B$(u,ev) = C<s>(u, v)e 2 + o(e 2 ). 

Eq. (3.5) is exactly the integral representation for the matrix Bregman divergence proved in [13]. 
Similarly, Eq. (3.6) follows from 


B$(u, v) = — Tr[4> (it + sv) 
as 


S = 1 


-Tr[4> (it + su)] 

ds 


s=0 


- 1 d / d 


Tr[4> (it + su)] ) ds 


Jo ds Vds 

= / C<$>{u + sv, u)ds. 

Jo 

Eqs. (3.7) and (3.8) can be obtained by Taylor expansion at (it, 0). That is, 

A<f>(u, ev) 

= A$(u, 0) + D u Aj>[it, 0](0) + 0](eu) 

+ — ([it, 0](0, 0) + 2D U D V A§ [it, 0](0, ev) + D^j4<j>[ it, 0](eu, eu)) + o(e 2 ) 


= TV 


D<L[it + 0](eu) — D<R[ix] (D[u](eu)) + -D~<I>[ii + 0](et?, ev) 


+ o(e 2 ) 


= -C$(u,v)e 2 + o(e 2 ). 


Following the same argument, 

B^(u, ev) 

= B$(u, 0) + D 2 4>[tt + 0](0, ev) + D4>[it + 0] (D[u](eu)) — D4>[it] (D[u](et>)) 
+ i (D 3 <h[it + 0](0, ev, ev) + 2D 2 4>[u + 0](et>, ev)') + o(e 2 ) 

= C$(tt, u)e 2 + o(e 2 ). 


12 






(3.9) 


We can observe from Eqs. (3.5) and (3.6) that the joint convexity of (u,v) i —> A${u,v) and 
(u,v) i->- B$(u,v) follows from that of (u,v) <—> C$(u,v). In other words, we proved that 
conditions (d)=>(b) and (d)=^(c). 

Conversely, Eqs. (3.7) and (3.8) show that (b)=>(d) and condition (c)=>(d). To be more specific, 
the joint convexity of (u,v) H > A$(u,ev) implies 

tA$(ui,evi) + (1 - t)A$(u 2 ,ev 2 ) > A$(u,ev), 

for each u \, U 2 , V \, v 2 € Mj, t £ [0,1], e > 0, and u = tu\ + (1 — t)u 2 , v = tv\ + (1 — t)v 2 . 
Invoking Eq. (3.7) gives 


and 


/ x . n ,x, , x tC^(u 1 ,v 1 ) + (l-t)C< s ,(u 2 ,v 2 ) 2 , 2 x 

L4$(iii, etft) + (1 - t)A$(u 2 , ev 2 ) = ---e + o(e ), 


A$(u, ev) = ^C$(u, ev)e 2 + o(e 2 ). 


Hence, Eq. (3.9) is equivalent to 

tC<t>(ui,Vi)e 2 + (1 - t)C<s>(u 2 ,v 2 )e 2 + o(e 2 ) > C$(u, ev)e 2 + o(e 2 ). 

The joint convexity of (u, v) i->- C$(u, ev) follows by dividing by e 2 on both sides and letting e 
The joint convexity of (it, v ) >— > B$(u, ev) can be obtained in a similar way using Eq. (3.8). 

(a) (e): It is trivial if is affine; hence we assume <h ,/ > 0. We start from the convexity of the map: 


0 . 


(3.10) 


A i-)- — Tr 


h (DT[A]) _1 (h) , for all h £ Mf. 


To ease the burden of notation, we denote Ta = DTfA] ~ l^cfixd 2 anc j ft 4 ft ~ C <i2xl by the 
isometric isomorphism between super-operators and matrices 2 . Then Eq. (3.10) can be re-written 
as 

A i->- —h^ ■ T ^ 1 • h, for all h € C rf2xl , 

which is equivalent to the non-negativity of the second derivative (see Proposition 2.2): 


D 2 

~ U A 


h f • T^ 1 • hi ( k , k) = -h ] ■ D 2 a [T^ 1 ] (fc, k) ■ h 


>0, for all A y 0, h £ & 2xl , k 6 Mf. 


Now, recall the chain rule of the Frechet derivative in Proposition 2.1: 

OFo g[A\{u) = D F[6(A)\ (D G[A](u)) ; 

D 2 J-o G[A](u,v) = D 2 T[G{A)](Dg[A]{v),Dg[A]{v)) 

+ DF[g(A)]{D 2 g[A](u,v)), 

and the formula of the differentiation of the inverse function (see Lemma A.l): 
d g[A]-\u) = -g(A)- 1 • d g[A]{u) • g{A)~ l - 
D 2 g[A\- 1 (u,u) = 2g(A)~ 1 • D g[A]{u) ■ g(A)- 1 • D g[A\(u) ■ g(A)- 1 
-g(A)- 1 ■D 2 g[A\{u,u)-g(A)~\ 

we can compute the following identities by taking G[A] = T a, and u = k: 

d a [T a x ] (k) = -J A 1 .D A [J A ](k).J A \ 

and 

□a [Ta 1 ] (fc, k) = 2 • T A 2 ■ D A [J A ](k) ■ T^ 1 ■ D A [J A ](k) • T^ 1 
-TA'-DilTAj^fc)-^ 1 . 

Therefore, we reach the expression (3.2), and statement (a) is true if and only if (3.2) holds. 


2 Some authors refer Ta to the Louiville super-operator and call ft as the vectorisation of ft; see e.g. [62, Sec. II] 
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Recall that in the scalar case (i.e. d = 1), the Frechet derivative can be expressed as the product 
of the differential and the direction (see Proposition 2.6): 

D ^[a]h = tf'(a) • h. 

Hence, Eq. (3.2) reduces to 

h • (T'(a)) -1 • ^"'(a) • k 2 • (’P'(a))" 1 • h 
_ $""(a) • $"(a) • k 2 h 2 
$"(a) 2 

>2 -h - ('P'(a)) -1 • T"(a) • k • (^'(o)) -1 • T"(a) • k • (^'(a)) -1 • h 
_ 2$'"(a) 2 • k 2 h 2 
~ <h"(a) 3 ' 

for all a > 0 and h,k € IR. In other words, Eq. (3.2) can be viewed as a non-commutative 
generalisation of the classical statement: > 2<h" /2 . 

(d) (/): For any t € [0,1], define F t : X Mj" —> M™ as 

F t (X, Y) = t$(X) + (1 - t)$(Y) - <S>(tX + (1 - t)Y). 

By taking x = (X,Y) and h = (h, k) in Proposition 2.2, the convexity of the twice Frechet 
differentiable function Ft is equivalent to 

D 2 F t [X, Y]{h,k) h 0 VX,Y<EM+ and V/i,feeM)“. 

Then, with the help of partial Frechet derivative defined in Proposition 2.3, the second-order 
Frechet derivative of Ft{X,Y) can be evaluated as 

D 2 F t [X,Y](h,k) 

= D 2 x F t [X, Y](h , h) + Dy D xF t [X, Y](h, k ) 

+ DxDyFtlX, Y](k, h) + D lrF t [X, Y](k, k ) 

= t ■ D 2 <f>[X]{h, h)-t 2 ■ D 2 $[tX + (1 - t)Y](h, h ) 

- t( 1 - t ) • D 2 <F [tX + (1 - t)Y] (h, k ) - t( 1 - t) ■ D 2 # [tX + (1 - t)Y] (fc, h) 

(3.11) + (1 - t) • D 2 $[Y](fc, k) - (1 - t) 2 • D 2 §[tX + (1 - t)Y]{k, k). 

Taking trace on both sides of (3.11) and invoking Lemma 2.2, we have 
Tr [D 2 F t [X 1 Y](h,k)} 

= Tr [t • /iDtf [X](/i) - t 2 ■ hD^[tX + (1 - t)Y](h)] 

- Tr [f( 1 - t) ■ hD^f [tX + (1 - t)Y] (k) + t{ 1 - t) ■ fcDT [tX + (1 - t)Y] (h)] 

(3.12) + Tr [(1 - t) ■ kD^[Y](k) - (1 - t ) 2 • kD 2 ^[tX + (1 - t)Y](fc)] . 

Since both the trace and the second-order Frechet derivative are bilinear, we have the following 
result 

Tr [f 2 • hD^[tX + (1 - t)Y](h) + 1( 1 - t) ■ kD^[tX + (1 - t)Y](/i)] 

= {th, + (1 - t)Y](th)) + ((1 - t)k, + (1 - t)Y](th)) 

(3.13) = {th + (1 - t)k, + (1 - t)Y](th)). 

Similarly, 

Tr [t{ 1 - t) ■ hDV[tX + (1 - t)Y}(k) + (1 - t ) 2 • kD^[tX + (1 - t)Y](k)] 

(3.14) = {th + (1 - t)k, DT[tX + (1 - f)Y]((l - t)k )). 
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Combining Eqs. (3.13) and (3.14), Eq. (3.12) can be expressed as 

Tr [D 2 F t [X, Y](h, k)] = t ■ (h, D [X](h)> + (1 - t) • (fc, DT[X](fc)> 

- ((th + (1 - t)k), DT[tX + (1 - t)Y]{th + (1 - t)k)). 

Then, it is not hard to observe that the non-negativity of Tr [D 2 E)[X, Y](h, fc)] for every X, Y £ 
Mjr, h,k £ M^ a , and t £ [0,1] is equivalent to the joint convexity of the map 

(X, A) (X,DT[A](X)) = Tr [D 2 <L[A](X, X)] . 

(j) => ( g ): Considering n = 2, the sub-additivity means that 

H$(Z) < Ei^ 2) (Z) + E 2 H£\Z). 

Then, we have 

E iH^(Z) > H*(Z) - E 2 h£\z) 

= E<h(Z) - <h(E Z) - E 2 Ei<f>(Z) + E 2 4 >(EiZ) 

= E 2 4>(EiZ) — < f*(E 2 EiZ) 

= H^{E\Z). 

(/) (h): Let s £ [0,1], define a pair of positive semi-definite random matrices (X,Y) taking values 

(x,y) with probability s and ( x',y') with probability (1 — s). Then the convexity of implies 
that 


(3.15) H$(tX + (1 - t)Y) < tH*(X) + (1 - t)H*(Y) 

for every t £ [0,1]. Now define F t (u, v ) : M^ x —>• H as 

Ft(u, v ) = Tr [i$(tt) + (1 — t)<3?(u) — $(tu + (1 — f)u)] . 
Then, it follows that 

sF t (x, y) + (1 - s)F t (x', y') - F t (s(x, y) + (1 - s)x', y') 
= t E<f>(X) - i$(EX) + (1 - f)E<L(X) - (1 - t)$(E7) 
- E<1> (tX + (1 - t)Y) + <L (tEX + (1 - t)EY) 

= tH$(X) + (1 - t)H*{Y) - H*(tX + (1 - t)Y), 


which means that the convexity of the pair («, v) i —> Ft(u, v ) is equivalent to the convexity of H <j>, 
i.e. Eq. (3.15). 

(g) (h): Define a positive semi-definite random matrix Z = /(Xi,X 2 ), which depends on two random 

variables Xi,X 2 on a Polish space. Denote by Zx 1 the random matrix Z conditioned on Xi. 
According to the convexity of iLj>, it follows that 


E 1 ff$(Z|X 1 ) 


= Ei H^(Z Xl ) 


= Ei 


tr (E 2 4 >(Zx 1 ) — 4>(E 2 Zxi)) 


> tr E 2 4 > (EiZxi) — tr [3 >(EiE 2 Zxi)] 


= Hg,( EiZ). 


Conversely, define a positive semi-definite random matrix Z(s, X, X) = sX + (1 — s)Y where 
s is a random variable. Now let s be Bernoulli distributed with parameter t £ [0,1]. Then for all 
t £ [0,1], the inequality E\H^{Z\s) > H^(E\Z) coincides 

H$(tX + (1 - t)Y) < tH*(X) + (1 - t)H$(Y). 


□ 
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4. Matrix Concentration Inequalities 


The main results in this section include various matrix concentration inequalities for matrix <h-entropies. 
We derive matrix Poincare inequalities for general multivariate super-operators C : (M^) n —>• 
(Theorem 4.2), multivariate standard matrix functions (Corollary 4.1), and Lipschitz functions (Corollary 
4.2) in Section 4.1. We then extend the matrix Poincare inequality to Gaussian distribution (called matrix 
Gaussian Poincare inequality, Theorem 4.3). These results rely on the matrix Efron-Stein inequality, 
which is first proved in [28, Theorem 4.2]. Here we re-derive a special case in Theorem 4.1 that uses the 
subadditivity property of matrix 4>-entropies. Its proof is, arguably, simpler. 

Section 4.2 presents the results on Sobolev inequalities for matrix 4>-entropies. The matrix <P-Sobolev 
inequality of symmetric Bernoulli random variables and that of Gaussian random variables are in Theorem 
4.5 and Theorem 4.6, respectively. The matrix logarithmic 4>- Sobolev inequality of symmetric Bernoulli 
random variables and that of Gaussian random variables are in Corollaries 4.3 and 4.4. 

Throughout this section, let X = (X],..., X n ) be a series of independent random variables taking val¬ 
ues in some Polish space and let Z = £(X) € M^ a be a random matrix such that ||EZ 2 || 00 < oo. Let X[ 
be an independent copy of X*, for 1 < i < n, and denote X® = (Xi,..., X,_i, X', X l+ \,..., X n ), i.e. re¬ 
placing the z-th component of X by the independent copy X'. Let X_j = (Xi,..., X,;_j, X %+ \,..., X n ) 

and Ej[ • ] = E[ • |X _*], i.e. expectation with respect to the i-th variable. Finally, denote by Z\ = C ^X®^ 

for every i = 1,..., n. 


4.1. The Matrix Poincare Inequality. Define the quantity 

n ~ 

^(£(X)-£(X«))“ , 

.»=l 

and we use notations £(£) and £(Z) interchangeably in the following. The quantity £(Z) has the following 
equivalent expressions (Lemma A.2): 


£(£) = ^trE 


(4.1) 


T(Z) = -^trE[(Z-Z')‘ 


2—1 


^^trE (Z — WiiZ) 
2 — 1 
n 

]Tee[(z-z')1 


i =1 


where for A G M sa , (A) + = Ai|z)(z| denotes the contribution from its positive eigenvalues. Denote 

the (real-valued) variance of a random matrix A (taking values in M^ a ) by 


Var(A) = tr E (A - EA) 


= tr 


E A 2 


(EA) 2 


Theorem 4.1 (Matrix Efron-Stein Inequality). With the prevailing assumptions, we have 


(4.2) 


n 

Var (Z) < £(Z) = J^trE 
2—1 




Proof. Denote Var(*) ( Z) = tr E ?; (Z - E ?; Z) 2 . Since Z[ is an independent copy of Z conditioned on X_j, 
hence Lemma A.2 yields 


(4.3) 


Var® (Z) = tr Ej 




With the equivalences Var(Z) = H$(Z) and Var® (Z) = iL®(Z), where 4>(u) = u 2 , this theorem is a 
direct consequence of the subadditivity of matrix T-entropies; namely, Theorem 3.2: 
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n 

Var (Z) < ^EVar W (Z) 

i —1 


E trE f(Z-Z') 


2 ' 
+ 


i= 1 


= £(Z). 


□ 


The last line follows from (4.1). 

Remark 4.1. Theorem 4.1 can be expressed in terms of Schatten 2-norm (see Eq. (4.2)): 

1 n 

(4.4) ^\\Z -EZ\\l< -EJ2\\Z ~ Z'\\ 2 2 . 

i= 1 

We remark that Paulin, Mackey, and Tropp [28, Theorem 4.2] have derived a generalised Efron-Stein 
inequality 

E||Z-EZ||g < 2{2p - 1) • E 

which reduces to Eq. (4.4) (as p = 1) with a worse constant. 0 

The matrix Efron-Stein inequality can be used to prove a matrix version of the Poincare inequality. 

Theorem 4.2 (Matrix Poincare Inequality). Let X = (Xi,...,X„) € (M^) n be an n-tuple indepen¬ 
dent random matrix taking values in the interval [0,7] (under the Lowner partial ordering) and let 
C : (M^([0, l])) n —>• be a separately convex function^ with finite partial Frechet derivatives. Then 
£(X) = C(X i,..., X n ) satisfies 


n 

E( z ~ z 0 2 


Vax(£(X))<^E[||D x ,£[X]|||], 

i=1 

where ||Dxj>C[X]|| 2 is the norm of the Frechet derivative defined in Eq. (2.4). 

Proof. Recall Z = £(X) and Z t = 7(xW) = C(X i, ..., Xj_i, X', Xi + 1 ,..., X n ), where X[ is an identical 
copy of X{. The proof follows from the matrix Efron-Stein inequality, Theorem 4.1: 


n 

Var (£(X)) = Var(Z) < ^tFE (Z-Zf)\. 

1=1 


^Note that C here is a multivariate super-operator. The separate convexity means that, for 0 < t < 1, 

tc(Y) + (1 - t)C( Y w ) r< C (tY + (1 - t) Y (i) ) 
for Y = (Xi,. .., Y n ) € (M5°)"> and Y« = (Yi, ..., V-i, Yf, Y i+1 ,... , Y n ) € . 
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Then it suffices to bound each term tr E (Z — Zi) 2 + of the right-hand side above: 

ti(Z - Zif + = tF (c (X) - C (x«))^ 

<t?(D Xi £[X] (X,-X')) 2 + 

< tF|D Xi £[X] (Xi-XOI 2 

= ||D Xi £[X] (Xi-XfiWlfa 
<||D Xl £[X]||M| Xi-XlWl/d* 

< \\D Xi £[X ]\\ 2 2 ■ \\I\\l/d 2 

= \\VxAM\l- 

The second line comes from the separate operator convexity of £: 

(4.5) C (X) — C (X«) =< D Xi £[X] (X t - X') , 

and the monotonicity of the trace function composed of the monotone function ( • )+ (see Proposition 
2.4). The next line follows from the relation 

A + \ A\ , VA € M sa . 

The fourth line follows from the definition of Schatten 2-norm (2.1) (i.e. Frobenius norm): 

IHI 2 =(Tr| • I 2 )"- 

The fifth line follows directly from the norm of Frechet derivatives, i.e. 

I|D/[A](B)|| 2 < ||D/[A]|| 2 • ||B|| 2 , VA,B g M sa . 

Finally, we use the assumption 0 ^ X t ,x[ A I and ||/||2 = Ah to complete the proof. □ 


Note that Theorem 4.2 generalises the classical Poincare inequality (e.g. [5, Theorem 3.17]): 

Var(/(X))<E[||V/(X)|| 2 ] , 

where X = (X \,..., X n ) denotes an independent random vector; each element takes values in the interval 

[ 0 , 1 ]. 


Remark 4.2. We remark that although the matrix Poincare inequality and its classical counterpart have 
similar expressions, their proof techniques are different. The proof of classical Poincare inequality relies 
on an infimum representation of Z* = inf x / f(X i,..., ..., X n ) in the Efron-Stein inequality in Eq. (??) 
(see e.g. [24, 25]). The reason for choosing such a Zi is to guarantee the (almost surely) non-negativity 
oiZ-Zi so that the square function preserves the ordering of the inequality, i.e. 

o < f{x ) - / (xW) < ^p(W - X') 

=* (/m-/(v«)) 2 <(^w-y)) 2 . 

However, the matrix version of Poincare inequality is more intricate; namely, the infimum may not exist 
in the range of a matrix-valued function C(X). Such a difficulty can be circumvented by considering 
(£(X) — £(XW))_(_; hence the separable operator convexity (4.5) cannot be applied. 0 


Note that Theorem 4.2 considers the matrix Poincare inequality for general matrix functions C : 
(M^ a ) n —>• M^ a , while in Corollary 4.1 below, we impose additional pairwise commutative criteria on 
X = (Xi,..., X n )- namely, [Xi,Xj] = 0 almost surely for i ^ j € [n], we can have the following Matrix 
Poincare inequality for multivariate standard matrix functions (Definition 2.2). 
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Corollary 4.1 (Matrix Poincare Inequality for Multivariate Standard Matrix Functions). Let X = 
(X \,..., X n ) be an n-tuple independent random matrix taking values in S^ with joint spectrum in [0,1]”, 
i.e. [Xi,Xj\ = 0 almost surely for i / j G [n]. Let f : ([0,1]) rx —>■ K be a multivariate standard 
matrix function that is separately operator convex and has finite partial Frechet derivatives. Then, 
/(X) = f(X i,..., X n ) satisfies 

n 

Var(/(X))<^E||[ w (A t ,A,)]J|f up , 

i =1 

where is the divided difference of f defined in Eq. (2.4), and A& = (Aifc,..., A ik, ■ ■ ■, X n k ) with A ik being 
the i-th eigenvalue of X^. 


Proof. Following the same argument in the proof of Theorem 4.2, we have 

t t{Z- Z,)\ < || D x J[X] (X, - X[) || 2 /d 

= \\[<Pi (A fc ,A t)] H 0(X z -X')\\ 2 2 /d 
<ll^(A fc ,A,)]J s 2 up .||X 8 -X'||^/d 

<II[W (^A,)]J s 2 up -||/||^/d 

= || [<Pi (A fc ,A,)]J sup , 

where the second line follows from the multivariate version of Daleckii and Krein formula (2.7). In the 
third line we apply the norm inequality for the Schur product [63], i.e. 

||A0B|| 2 < ||A|| sup • ||S|| 2 , MA,B € M. 


□ 


Remark 4.3. It is worth mentioning that Corollary 4.1 also reduces back to the classical Poincare inequality 
for real-valued functions (d = 1) and for vector-valued functions (corresponding to (X ? ;)ih 1 being d x d 
diagonal random matrices). Hence, Theorem 4.2 is essentially a “non-commutative” generalisation of 
classical Poincare inequality while Corollary 4.1 is a “commutative” generalisation. 0 


If the multivariate standard matrix function / is a Lipschitz function with the Lipschitz constant 

\f(x) - f{y )| 


A = sup ■ 


where x = (aq,' 
appealing form. 


,x r 


\\ x -yh 

€ H n , then the matrix Poincare inequality (Corollary 4.1) has an even more 


Corollary 4.2 (Matrix Poincare Inequality for Lipschitz Functions). Let X = (X ±,..., X n ) be an n-tuple 
of independent random matrices taking values in with joint spectrum in [0, l] n ; i.e. [ X j, Xf = 0 almost 
surely for i j € [n]. Let f : ([0, l]) n —>• 1R be a multivariate standard matrix function with Lipschitz 
constant ||/||a- Then for all n-tuples of random matrices X, there exists a universal constant k{n ) such 
that 

Var(/(X))<fc(n) H/ll! . 

Proof. The main ingredient to establish this corollary is to use the following bound for Lipschitz functions. 

Proposition 4.1 ([64]). Let A = (Ai)^ =1 and B = (B.j)” =1 be families of commuting self-adjoint opera¬ 
tors. If f : H n —>• C is a Lipschitz function, then there exists a constant k'in), 

n 

II/(A) - /(B)|| p < k\n)\\f\\ A £ ||Aj - B,\\ p . 

i =1 
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It follows that 


tr 


/ (X) — / (Xj 


1 2 


< 


/(X) - /(X, 
k'(n)\\f\\ A - 


\x - X 


i 2 



< -k'(n) 2 
= k'(n) 2 


2 II rii2 

All- 1 II2 


2 

A) 


and therefore 


Var(/(X))<]TEtr [/(X) - /(X*) 


2=1 

_ k’{nf " 
“ 2 ^ 

= fc(«) ll/lli 


where, in the last line, fc(n) = n-'-^r~ 


□ 


The significance of the subadditivity property can be seen in Corollary 4.2 when the function is a 
Lipschitz function. The resulting Poincare inequality only depends the Lipschitz constant and a universal 
constant k(n). 

Remark 4.4. A multivariate function / is called operator Lipschitz if there exists a universal constant C, 

II /(A) — /(B) || < C ■ max || A, - Bi\\ 

1 < 2 <n 

for all n-tuple commuting self-adjoint operators (A.j )" =1 and It was proved that not every 

Lipschitz function / is operator Lipschitz [65]. 0 

The matrix Efron-Stein inequality is used in Theorem 4.2 to prove the matrix Poincare inequality. 
Next we will show that the matrix Efron-Stein can be also applied to establish an upper bound, known as 
the Gaussian Poincare inequality, for a Frechet differentiable matrix-valued function of Gaussian Unitary 
Ensembles (GUE ) i . 

Theorem 4.3 (Matrix Gaussian Poincare Inequality). Let X\, . .., X n be independent random matrices 4 5 
from the Gaussian Unitary Ensemble and let C : —>• be any twice Frechet differentiable 

function. Then £(X) satisfies 


Var (£(X)) < 


2=1 


OxX[X] 


Proof. We borrow the idea from [67] (see also [5, Theorem 3.20]) to prove this theorem. First, we assume 


£[LiE[||D Xs £[X]||(; 
theorem for n = 1: 


< 00 ; otherwise the inequality holds trivially. Second, it suffices to establish the 


(4.6) Var (£(X)) < E ||D£[X]|| 2 

It can be easily extended to every n £ IN by applying the matrix Efron-Stein inequality, Theorem 4.1. 


4 The Gaussian Unitary Ensembles are a family of random Hermitian matrices whose upper-triangular entries are inde¬ 
pendently and identically distributed (i.i.d.) complex standard Gaussian random variables, while the diagonal entries are 
i.i.d. real standard Gaussian random variables; see e.g. [66, §2.6] 

^We consider “entry-wise” independence here. 
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Now for every j G [m] = {1,..., m}, denote by Wj, W' the d\ x d\ matrices whose entries are sampled 
from independent Rademacher random variables (i.e. uniformly {±l}-valued random variables) and let 


Yj = 


Wi + i-W'A + iWj+i- W' 


Let ei,... ,e m be a series of independent Rademacher random variables, and define 

1 


c A ^ VeV 

J m — /— / 17 J j . 

Jm A-f 


i=i 


Then, for every j G [m], 


Var (j) (£(S m )) = ^ tr 


£ [ ‘S'm H-j 


Invoke the matrix Efron-Stein inequality to obtain 


(4.7) 


Var(£(S m )) = - tr E 


j=i 


£ S m + 


m 


1 - e,- 


1 + e,- 


Y) - £ S™ - —^Y 


m 


m i /— ± 1 

m 


YA — £l S m - 


1 + e 


3 y 


Now we use Taylor expansion to further bound the right-hand side of Eq. (4.7). For every i G [n] and 
some constants 0 < a, (3 < 1, it follows almost surely that 


£ ( S m + l -^Y 3 ) = £(5 m ) + D£[5„ 


C[S m - 


"m 

1 + €7 


1 ej Y^\+n 2 (s m ,++Y)\ 


rn 


m 


m 


1 + e 


Yj = C(S m ) + DC[S m ] - r^Yj +^2 S m , - ^J-Yj , 


m 


1 + 67 


m 


where IZi : x —>■ M^ 2 is the remainder term of the Taylor series: 


77*(X, E) = ^ ^D fc £ [X] (E,...,E) = o (\E\ l ) . 


k=l 


k\ 


Therefore, 


1 - €j 


1 + 67 


C[S m + ^-^=-Yj ) - £ ( S m - —^Y, ) £ -£=D£[S m ] (Y-) + o ( £- ) , 


m 


m 


and 


. m 

zX> E 


3 = 1 


1 - e. 


£ 1 ^m H- * J 


Y)-C[S m - 


1 + e 


3 y 

I 3 


< ||D£[S m ]||' + o 
Let m go to infinity, we have 


^ m 


(4.8) lim - tr E 

m—»• 00 4 ^ 


i=l 


c\s m + 


1 - e,- 


m \ /— -‘■7 

m 


Y - £ - 


1 + e 


3 Y 


= E 


|D£[X]| 


where by the central limit theorem (see Lemma A.3) that converges in distribution to a random 
matrix X in GUE. Thus Var ( C(S m )) converges to Var (C(X)). 
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Finally, the subadditivity of the variance and Eq. (4.6) lead to 


which completes the proof. 


Var(£(X)) < ^ E |varW(£(X)) 


i= 1 


n 

<E E [ E * [ll D ^ £ [ x H 




E I' fl|DxX[X]| 


i=\ 


□ 


4.2. Matrix <L-Sobolev Inequalities. For completeness, we provide a short review of the classical 
Bonami-Beckner inequality in Appendix B. 

In this section, we consider matrix-valued functions defined on Boolean hypercubes: f : {0,1} —>• M^ a 
and establish matrix <L-Sobolev inequalities. The main ingredient to prove this inequality comes from 
Fourier analysis and the hypercontractive inequality for matrix-valued functions. 

Ben-Aroya et al. [18] generalised Bonarni and Beckner’s results by considering matrix-valued functions 
f : {0,1} —>• M,j. Similarly, Fourier analysis can be naturally extended into the matrix setting; that is, 
the Fourier transform f of the matrix-valued function f can be expressed as 

I f(S) = ^ Exg{o,i }nj(x)xs{x); 

!/(*) = Esc{i,...,n} f(S)xs{x ), 

Therefore, the hypercontractive inequality (B.2) can be extended to matrix-valued functions. 


Theorem 4.4 (Matrix Bonami-Beckner Inequality [18]). For every f : {0, l} n —» and 1 < p < 2, 

\ 1/2 / \ i/p 

Ihr-- 1 ) 131 f(S) 2 < 4 E ll/Wllip . 

>c[n] p J y xe{o,i}™ J 

where the normalised Schatten p-norm is defined as ||AT||* p = (tr |Af| p ) 1//p for M E M^. 

With Theorem 4.4, we can prove a matrix •F-Sobolev inequality for matrix-valued functions defined on 
symmetric Bernoulli random variables. 


Theorem 4.5 (Matrix 4>-Sobolev Inequalities for Symmetric Bernoulli Random Variables). Let X be 
uniformly distributed over X = {0, l} n (an n-dimensional binary hypercube) and f : X —>• be an 

arbitrary matrix-valued function. Then for all p € (1,2), and <1 }(u) = u 2 / p , 

(4.9) Hz(fP) < (2 - p)£{f) • d l ~ 2 ' p + trE[/ 2 ] • (1 - d 1 " 2 ^). 


Proof. Starting from the left-hand side of Eq. (4.9), the definition of the matrix T-entropy functional 
gives that 


(4.10) 


H^fP) = t?E[/ 2 ] 
< tr E [/ 2 ‘ 
= t?E [f 
= RE [f 


-tF (E/ p ) 2/p 

- (tr E/ p ) 2/p 

- (tr E|/| p ) 2/p 

- ( E ll/ll* p 


2 tv 


(since f(X) y 0) 


where we apply the convexity lemma of normalised trace function (Lemma 2.1) and recall that ( • ) 2 / p is 
a convex function for 1 < p < 2. 
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Next, since the Schatten p-norrn is a non-increasing function of p, we have the following relation for 
the normalised Schatten jj-norrn 

\\M\\l p = (Yr\M\P) 2/p 

= (Tr\M\ p ) 2/p -d~ 2/p 
> (Tr |M| 2 ) 2/2 • d~ 2/p 
= (t?|M| 2 ) • d l ~ 2/p 

(4.11) =\\M\\l 2 -d l ~ 2 / p , 


for every M G and 1 < p < 2. Then, by combining Eq. (4.11) and Theorem 4.4, Eq. (4.10) can be 
rewritten as 


H$(f p ) < trE [f 2 ] - (e||/||*, 


2/p 


<trE[/ 2 ]- ( £(p-l) |S| ||/(S) 

K SC[n] 


2 

*P 


(4.12) 


trE [/ 2 ] - ( £ (p- l)' 5 ' tr [ f(S ) 2 j ) • d 1_2/p 
K SC[n] 


= tr 


= tr 


£ ns ) 2 

SC[n] 


£(p-l)l s ltr f(S) 2 ] ) -d 1 " 2 ^ 

K SC[n] 


£ (l-b-l) |S| ^- 2/P )/(5) S 

SC[n] 


where we apply Parseval’s identity (Lemma A.4) in the fourth line. 
From the elementary analysis, it holds for all S C [n] and 1 < p < 2, 


1-(P-1) |S| < (2-p)|5|. 


Therefore, it follows that 


1 - (p - l)l 5 ld 1 " 2 /P < (2 - p)\S\d l ~ 2 / p + (1 - d l ~ 2 / p ). 

Finally, by using the fact that ^sc[n]t r ^/(S) 2 = £(/) (see Lemma A.5), Eq. (4.12) can be further 
deduced as 


H„(f p ) < tr 


£ (i-(p-i) |5| ^- 2/p )/(5)' 


SC[n] 


< tr 


£ ((2 - p)\S\d l ~ 2 ' p + (1 - d 1 - 2 ^)) f(S? 

SC[n] 


= (2 - p)£(f) ■ d l ~ 2 / p + tr £ f(S) 2 • (1 - d 1 " 2 ^) 

SC[n] 

= (2 - p)£(f) ■ d l ~ 2 ' p + t?E[/ 2 ] • (1 - d 1 - 2 ^), 


□ 


which completes our claim. 
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Theorem 4.6 (Matrix 4>-Sobolev Inequalities for Gaussian Distributions). Let X = (X±,..., X n ) be a 
vector of n independent standard Gaussian random variables taking values in X = R n , and let f : X — >■ 
be an arbitrary matrix-valued function. Then for all p € (1, 2), and <3?(u) = u 2//p , 


(4.13) (2-p)5^e[||D*./[X]||^] •d 1 - 2 /P + trE[/ 2 ]-(l-d 1 - 2 /P). 

i= 1 

Proof. The proof parallels the approach in Theorem 4.3. Denote 

i r _ n _ _ 

f (0 (/) = 2^Ei £(/(*)-/(* (i) ))‘ 

_ 1=1 

and 

n 

w = E E ^ (i) (h • 

Recall from Eq. (4.8) and let Y) = 1: 


i =1 


l 


£ {l \f)= lim vV'trEj 

m—>-oo 4 z ' 




/ 5m + 


1 - e,; 


- f (s m - 


1 + e* 


— E; 


|Dx./[X]| 


This and Theorem (4.5) yield Eq. (4.6) and the statement follows. 


□ 


The logarithmic Sobolev inequality for matrix-valued functions immediately follows from Theorems 4.5 
and 4.6. 

Corollary 4.3 (Matrix Logarithmic Sobolev Inequalities for Symmetric Bernoulli Random Variables). Let 
f : {0, l} n —>• be an arbitrary matrix-valued function defined on the n-dimensional binary hypercube 

and assume that X is uniformly distributed over {0, l} n . Then 

Ent(/ 2 ) < 2£ (/) + log(d) • tr E [/ 2 ] . 

Proof. By letting p —>• 2, the left-hand side of Eq. (4.9) becomes 

HM-) , tf[E[/(X)2]-(E[/(XF] 2 /41 Ent(/ 2) 

lim - = lim — 12 --- — = -, 

p— ;2 _ 2 — p p— ; 2- 2 — p 2 

where the last identity follows from Lemma A.6. Similarly, the right-hand side gives 


lim 

p— > 2 ~ 


(2-p)S(f)-d 1 ~ 2 /P + trE[f 2 ] .(l-d 1 - 2 ^) , log(d) 


2 -p 


= £(f) + 


trE [f 


as established. 


□ 


Corollary 4.4 (Matrix Gaussian Logarithmic Sobolev Inequalities). Assume that X is a vector of inde¬ 
pendent and identical standard Gaussian random variables on R n and let f : M ri — > be an arbitrary 

matrix-valued function of X. Then 


Ent(/ 2 ) < 2^E [||D Xi /[X]||^ + log(d) - trE [f 


i= 1 


Remark 4.5. Denoted by LS(C, D) (see e.g. [68, Section 5.1]) the set of logarithmic Sobolev inequalities 
with constants C > 0, D > 0: 

Ent(/ 2 ) < 2C£{f) + DE[f 2 ]. 

When D = 0, the logarithmic Sobolev inequality is called tight; otherwise, it is called defective. It is well 
known that the best constants of the classical logarithmic Sobolev inequalities for symmetric Bernoulli 
random variables and standard Gaussian random variables are (C, D) = (1, 0) [15, 69]. However, numerical 
simulation shows that examples (d > 1) exist for matrix-valued functions so that: Ent(/ 2 ) > 2 £{f). In 
Corollary 4.3, we establish the logarithmic Sobolev inequality with constant ( C,D) = (l,logd). <0 
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5. Entropic Inequality for Classical-Quantum Ensembles 


In this section, we connect the matrix <h-entropies with quantum information theory and present a 
functional inequality for the classical-quantum (c-q) ensembles that undergo a special Markov evolution. 

Let X be a sample space. We denote by &(X) the set of all probability distributions on X and by 
&*(X) the subset of &(X) which consists of all strictly positive distributions. The set of all dx d matrix¬ 
valued functions on X is denoted by JP(X); <^(X) and X ) are the subsets of consisting of all 

strictly positive and non-negative functions, respectively. 

Any classical discrete channel or Markov kernel with input alphabet X and output alphabet y can 
be described by a transition probabilities {K(y\x) : x G X, y G T}- For any probability distribution y 
defined on the alphabet X, we denote the channel acting on y from the right by 

(5.1) yK(y) = ^ y(x)K(y\x), y G T, 

or acting on matrix-valued functions f E ^(y) by 

( 5 . 2 ) Kf{x)±Y. K (y\^f^ x ^ X - 

y&y 

The set of all classical channels is denoted by «dt{y\X). If y ® K G &(X) x xfY(y\X) denotes the 
distribution of a random pair (X,Y) E X x y with P\ = y and Py\x = K, then 

( 5 . 3 ) Kf(x) = E[f(Y)\X = x} 


for any / G J£"(T) and x G X. We say that a pair (y,K) G £P(X) x .M (y\X) is admissible if y G £P*(X) 
and yK G £?*(y). Hence the backward or adjoint channel K* G ~Yt(X\y) can be defined by 


(5.4) 


K*(x\y) 


K(y\x)y(x) 

yK(y) 


(x,y) G X x y. 


If (X, Y)~ yx K, then K* = P x \y and 

(5.5) K*f(y) = E[f(X)\Y = y] 


for any f G ^(X) and y G y. 

Recall that the conditional matrix <f>-entropy of Z given Y which takes values in any Polish space can 
be defined by 

(5.6) H*(Z\Y) = tr E [$(Z)\Y] — tr [<L (E [Z\Y])] . 


Combining the definition of matrix T-entropies with (5.6) immediately gives the following law of total 
variance: 

Hq,(Z) = trE[$(Z)] - tr[$(EZ)] 

= tr EyE [$(Z)] - tr [<F (EyE [Z\Y])] 

5 ' 7 ^ =E y [H$(Z\Y)+ti<f>(E[Z\Y})\ -tr[$(E y E[Z|Y])] 

= Ey [H 9 {Z\Y)\ + Hs (E \Z\Y \). 


Fix <I>(tt) = u\ogu and assume that the distribution y G &(X) is defined on a discrete space X. If we 
consider a random matrix px to be an ensemble of classical-quantum (c-q) states (p, u) = {(y(x), p x )} x& Xi 
where each p x Y 0 and Trp^ = 1, then its 4>-entropy is related to the Holevo quantity of {(y(x), p x )} x ex- 

d ■ Hy(px) = ^2 P( x ) Tr IP* 1o S Px] ~ Tr [Plog p] 

= X] ' s (p*\\p) 

x&X 

=■ x(p,v), 


where p = E ^[px] = Sxe x P( x )px an< l <S(p||<r) — Trp(logp — log tr) is the quantum relative entropy. 
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Denote by py = {p'(y), p y } ye y the resulting random matrix of p\ that undergoes a Markov evolution 
K by the rule: 


(5.8) 


(5.9) 


{p(x)} xeX i-G {pK(y)} yG y = < Y v(x) K (y\x) \ =■ W(y)}y&y; 


< x£X 


y&y 


{px}x&x ^ {K*is(y)} yey = j Y K*{x\y)p x 1 =: {p' y } y& y. 

KxGX J ye y 


Note that each p y can be interpreted as the conditional expectation W<k*[px\Y = y], which is a post¬ 
selection state with the probability law {K*(x\y)} X £x- We also have the following relation between the 
<h-entropy of py and the Holevo quantity of (p',u') = {(p'(y), p' y } y ey- 

(5.10) d ■ H$(py) = Y /Ay) Tr [p'y lo § Py] ~ Tr [p 7 log p'] 

y&y 

( 5 - 11 ) =: x(p'iv'), 

where p' = E p'[py] = Yey p'(y)p'y 

Now for any p G ^(X) and K G Xt{Y |X), we define the constant: 

xGA V) 


(5.12) 


rjz(p,K) = sup 


x(y,v) 

By Jensen’s inequality, it can be shown that 0 < r]z(p,K) < 1 (see Lemma A.7). Therefore, we relate 
r]$(p,K) to the following functional inequality of the matrix <3?-entropies: 

Proposition 5.1 (Functional Form for C-Q Ensembles). Fix an admissible pair (p,K) and let (A, Y) be 
a random pair with probability law p <g> K. Then rjz(p, K) < c if and only if the inequality 


(5.13) 


1 


H*[f(X)]<- - E(Hz[f(X)\Y}) 


1 — c 


holds for all non-constant classical-quantum states f : X —>• Q(<C d ), where we denote by Q( C d ) all the 
density operators on C d . In particular, Eq. (5.13) can be expressed in terms of Holevo quantities: 


(5.14) 


1 


xM<- -Ey [ X {K\v\Y)\ 


1 — c 


where the expectation Ey is taken with respect to {pK(y)} y& y. 
Moreover, 

E [H$,(f(X)\Y)] 


(5.15) 


rjz(p, K) = 1 — inf 


: f 7 ^ const 


Proof. The inequality u ') < c • x(/L u ) is equivalent to 

H$(K* f(Y)) < cHz{f(X)) 

(5.16) = c (E [H*(f(X)\Y)] + H* (E [f(X)\Y])) 

= c (E [H*(f(X)\Y)] + Hz (. K*f(Y ))) 

where we use the identity of the law of total variance, Eq. (5.7), and the property of the backward channel, 
Eq. (5.5), from which we obtain 


(5.17) 
and hence, 

(5.18) 


Hz(K*f(Y )) < —E [Hz(f(X)\Y)] , 

Hz[f(X)] = E [Hz(f(X)\Y)} + Hz (. K*f(Y )) 
1 


< 


1 — c 


E(Hz[f(X)\Y}) 


□ 


26 








Raginsky showed, in recent work [26, 27], that if / = dv/dpL (i.e. a Radon-Nikodym derivative), then 


(5.19) 


K*f 


d(uK) 
d (fiK)' 


Moreover, the constant ?y<j>(/r, A) in Eq. (5.12) corresponds to the (classical) strong data processing in¬ 
equality (SDPI): 


(5.20) 


77 $ (/r, A) = sup 


D(nK, nK) 

D{y,n) 


where is the classical Kullback-Leibler divergence of fi and v. We remark that Proposition 5.1 

will recover Raginshky’s SDPI result by letting d = 1. 
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Appendix A. Miscellaneous Lemmas 

Lemma A.l (Second-Order Frechet Derivative of Inversion Function). Let Q : M —>• M be second-order 
Frechet differentiable at A € M, and Q{A) be invertible. Then, for each h,k € M, we have 

D G[A]~\h) = -G(A)- 1 • D Q[A\{h) • ^(A)- 1 ; 

D 2 g[A]~ 1 (h, k) = 2 • g(A)- 1 • D Q[A\(k) ■ QiA)- 1 ■ D G[A](k) ■ Q(A)- 1 

-Q{A)~ l ■D 2 Q[A\(h,k)-G{A)-\ 

Proof. Denote J- : A h->■ A as the inversion function and recall the chain rule of the Frechet derivative: 
D^o Q[A\{h) = D F\G(A)] (D G[A\{h )); 

D 2 ToQ[A]{h,k) = D 2 T[Q{A)\ {DG[A]{k),DG[A](k)) + D F\G(A)} (D 2 G[A\(h,k)) 

It follows that 

D G[A]-\h) = D ToQ[A](h) 

= DF[G(A)](DG[A]{h)) 

= -G(A)~ l ■Dg[A](h)-Q(A)-\ 

where we use the formula of the Frechet derivative of the inversion function in the last equation; see 
Eq. (2.2): 

D[X] _1 ("K) = -X~ l YX~ l 

(by taking X = G(A) and Y = Dt/[A](h,)). 

Furthermore, 

D 2 Q[A]~ 1 (h, k) = D 2 Fo g[A](h,k) 

= D 2 T[g{A)](Dg[A\{k),Dg[A\(k)) + DF[g(A)} (d 2 g[A\(h,k)) 

= 2 • g(A)- 1 • d g[A\(k) ■ g(A)- 1 • d g[A](k) ■ g(A)- 1 
-g(A)- 1 ■D 2 g[A\(h,k)-g{A)~ 1 . 

Again, we use the formula of the second-order Frechet derivative (see e.g. [46, Exercise 3.27]): 

D 2 [X]~ 1 (Yi,Y 2 ) = X-^X^YiX - 1 + X-'YiX-'YiX - 1 
(by taking X = G(A), and Y\ = I 2 = D G[A](h)). □ 
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Lemma A.2. Let X be a random matrix taking values in M sa ; and let Y be independently and identically 
distributed as X. Then for each natural number q > 1, 


(A.l) 

and 

(A.2) 

In particular, 
(A.3) 


E [|A - EA| 9 ] = E [(A - EX)’] + E [(A - EA)i] 
- E [| A — Y\ q } = E [(A - Y) q + ] = E [(A - Y) q _\ 


E 


(A-EA)' 


= -E 
2 


(X-Y) 


Proof. For each realization A of A in M sa , A = X + — A_ for some A + , A_ P 0 and A" + A_ = 0. Slightly 
abusing the notation, we hence write A + and A_ to denote the positive and negative decomposition of 
their realizations of A. 

Therefore, for each natural number q > 1, 

E [|A - EA| 9 ] = E [((A - EA) + + (A - EA)_) ? ] 

= E [(A - EA)| + (A - EA)I] 

= E [(A - EA)’] + E [(A - EA)I] . 

Likewise, we have 

\ E 0 A - Y\ q ] = i E [((A - Y) + + (A - A) + ) 9 ] 

= ±E[(X-Y)l + (Y-X) q + ] 

= ±E[(X-Y) q + ]+±-E[(Y-X) q + ] 

= E [(A - Y)%\ . 


The last line follows since Y is an identical copy of A. 

Following the same reasoning, we have |A| = A_ + (-A)_, and thus ±E [|A - Y\ q } = E [(A 
Finally, Eq. (A.3) follows from elementary calculations: 


-E 


(A — Y) 


= -E [A 2 - A Y - YX + Y 2 ] 

= i [EA 2 -EA-EY-EY-EA + E Y 2 
= EA 2 - (EA) 2 


= E 


(A — EA) 5 


A)i], 


□ 


Lemma A.3 (Central Limit Theorem of Gaussian Unitary Ensembles). Let {ej}j be a series of Rademacher 
variables, and let {Wj}j, {W'}j be d x d matrices whose entries are sampled independently from the 
Rademacher variables. Let 

(w j + i-wj) + (w j + i-w;f 

j 2 

and 

m 

Sm ±—Y^e j Y j 
y/m f-f 

3 =1 

where {ej}j are Rademacher variables again. If m tends to infinity, then S m converges in distribution to 
a d x d matrix in the Gaussian unitary ensemble. 
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Proof. It is clear from the central limit theorem that the diagonal entries converge to a standard real 
Gaussian variable, while the upper-triangular entries converge to a complex Gaussian variables with zero 
mean and unit variance. Next, we show that the correlation between any (non-identical) entry vanishes 
as m goes to infinity. That is, for every (k, l ) / {k ', l') 

n m 

i 3 

3 = 1 

from which we apply the strong law of large numbers to obtain 

m 

lim — V Y^ kl) Y^ k ' V) = E \Y ■ Y r 1 = E \Y] ■ E \Y'\ =0 almost surely, 

m—>oo 777 , J 3 3 L J L J 

3 =1 

where we denote by Y (resp. Y') the random variable that the sequences {Yj kl ^}j (resp. {Yj k 1 ^}j) are 
sampled from. It is easy to see that Y and Y' are independent zero-mean random variables. Therefore, 
the entries are mutually independent and linim^oo S m belongs to the Gaussian unitary ensemble. □ 


E, 


<a, 


' q(kl) Ci(fe'i') 


Lemma A.4 (Parseval’s Identity for Matrix-Valued Functions). For every matrix-valued function f : 
{0, l} n —» Mrf, we have the following identity 

E [/(A) 2 ] = Y, /(S) 2 , 

SC[n] 

where the expectation is taken uniformly over all x G {0, l} n . 


Proof. With the Fourier expansion of the matrix-valued function /, it follows that 


E [/(A) 2 ] 


E 


/(A) • Y f(S)xs(X) 


iSC[n] 


E ■ E [/(A)xs(A)] 

SC[n] 


E hs) 2 . 

SC[n] 


□ 


Lemma A.5. With the prevailing assumptions, and every f : {0, l} n —>• M“, we have 

Y t?[|5|/(5) 2 ] =£(f). 

SC[n] 

Proof. For every n-tuple x = (xi,... ,x n ) G {0, l} n , denote x^ = (x \,..., i, 1 — Xi,Xi + 1 ,... ,x n ). For 
every i € [n], introduce the matrix-valued function 

Then, for every 5 C [n], it can be observed that 
9i(S) = E [gi(X)xs{X)\ 


= -E 
2 


(/(A) - /(A W )) ■ (-l)^s 


Xi 


0 if i i S 
f(S ) if* €5. 


Apply Parseval’s identity, Lemma A.4 to obtain 

E [ 9i (X) 2 ] = Y 9i(S) 2 = Y 

( S'C[n] S:i£S 
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Finally, since X is uniformly distributed, £(f) can be rewritten as 

n ‘ 

E(/po-/(x«)y 


<?(/) = b rE 


1=1 


= - tr E 
4 


E (/w - / (* w 

2—1 

E iE [9»W 2 ] 


2=1 

n 


=EE" /( s ) : 

i=l S:igS 

= E te[|S|/(5) 2 " 

SC[n] 

This completes the proof. □ 

Lemma A. 6 . Let Z be a random matrix taking values in M + such that ||Z ’|| 00 < oo. For p € [1,2), we 
define the matrix-valued p-variance of Z by 

Var p [Z] = E[Z 2 ] - (e[Z"^ 


It follows that 


lim 


VarJZl 


1 , 


= -E 

p-> 2- 2 — p 2 


1 


Z 2 log Z 2 - -E [Z 2 ] • log E [Z 2 


Proof. We first prove a formula for the matrix differentiation. Denote by A = A(p) a Hermitian matrix 
which depends on the real parameter p. Then we aim to solve the derivative of A 2//p with respect to p. 
Let Y = A 2 / p . Then logT" = log A • 2 /p. Differentiating on both sides with respect to p and applying 
the chain rule of the Frechet derivatives (see Proposition 2.1), the above expression leads to 


d f 00 d 

— log Y= (sI + Y)- 1 ■ — Y • (sI + Y^ds 

dp J 0 dp 


(A.4) 


= -y- log A ■ 2 Ip 

dp 


2 2 r°° d 

logA + - / (tl + A)^ 1 ■ — A ■ (tl + A)~ l dt. 
P 2 P Jo 


Note that T d(K) — fo°( s ^ + -D) 1 K(sI + D ) 1 ds is called the Bogoliubov-Kubo-Mori operator and its 
inverse is well-known to be (see e.g. [70, Appendix C.2]): 


Tn{L) = I D s LD l s ds, 

Jo 


from which Eq. (A.4) yields 


d f 1 T 2 2 1°° d 

-t-Y= Y s —slog A+- (tl + A )- 1 • —A • (tl + A)- 1 
dp Jo IP P Jo dp 


di 


’-1 — s 


ds 


= [ L A 2s / p 
Jo 


2 , 

- 2 lQ S 

pZ 


2 r°° h 

A + - / (tl + A )" 1 • — A • (tl + A) 

P Jo dp 


-l 


df 


A 2/p-2s/p ds 


1 poo 


d 


= — 5 A 2 / p • log A + - / / A 2s / p (tl + A)" 1 • — A • (tl + A)- l A 2 / p ~ 2s /P dt ds. 
P 2 P Jo Jo dp 

Now taking A = E [Z p ], we have 

4- A = -4e[Z p 1 = E [Z p • log Zl, 
dp dp 
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and 


(A.5) 


d / E [Z p ]) 2 ' P = (E [Z p \ ) J/P log E [Z p \ 


dp 


P 


t * 1 /*oo 

/ / A 2s/p (tl + A)- 1 • E[Z p • log Z] • (tJ + A)" 1 A 2/p " 

Jo Jo 


+ 


2s/p dids. 


PJo Jo 

Finally, we are ready to prove our claim. L’Hopital’s rule implies 


lim V * r > lZ] = A( E 


p—>2 2 p 


p=2 


1 , 


= --E[Z 2 ] • log E [Z 2 ] 

n oo 

A s (tl + A) -1 • E[Z P • log Z] • (t/ + A)- 1 A 1 '"dtds 
= - -E [Z 2 ] • log E [Z 2 ] + E [Z 2 log Z] 


2 

= -E 
2 


Z 2 log Z 2 - ^E [Z 2 ] ■ log E [Z 2 ] 


completing the proof. 


□ 


Lemma A.7. Fix sample spaces X and y. For every distribution p € ^(X), Markov kernel K € 
J/(y\X) and matrix-valued function f : X —>• Mj, we have the following inequality: 

H$(K*f) = t? \e^ k [HK*f)} - <h (E ^ [K*f]) 


< tr 


E^/) - $ (E m /) 


= #*(/), 


where pK and K* are defined in Eqs. (5.1) and (5.4). 

Proof. We first observe that Jensen’s inequality [71, Section 5] holds for all convex function 4>: 
(A.6) ti[K*$(f)]>iE[$(K*f)]. 


After taking expectation with respect to pK, direct calculation shows that (note that we freely interchange 
the order of trace and expectation by Fubini’s theorem): 


trE ^ [K*<S> (/)] = Y, v K (y) ■ tr \ K ° f(y )] 
yey 


= E^)- tr 

y&y 


E 

.xex 


K(y\x)p(x) , , 

pK(y) }) 


= E^) tr [ $ (^( x ))] 

= hE, [$(/)] 

>Yp K (y)-K[®{ K *f)] 

y&y 

= FE^[<h(/r/)]. 


Together with the fact that E^k [AT*/] = E^f completes our claim. 


□ 
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Appendix B. Classical Bonami-Beckner inequality 


In this section, we review the so-called Bonami-Beckner inequality in Theorem B.l. Consider the vector 
space of all functions / : {0,1} —>• R endowed with the inner product 

(f,g) = ^ f( x )g(x) = ~E[f ■ g\, 

16(0,1}" 

where we use the notation that the expectation is taken uniformly over all x € {0,l} n . Then, every 
real-valued function defined on the Boolean hypercube can be uniquely expressed as the Fourier-Walsh 
expansion 6 : 

f(x) = f( S )xs(x), 

where the summation is over all 2 n subsets S C [n] = {1,..., n} and 

Xs(x) = (-1 )£ies*i 

form an orthonormal basis (also called Fourier basis ) since for any S,S' C [n], 


{XS,XS') = 


0 if S S' 
1 if S = S'. 


(We define xs = 1 for 5 = 0.) Hence, for all S C [n], /(S') = (f,xs) is called the Fourier coefficient of 

/• 

For any positive number 7 , define T 7 to be 

f = E 7 |S| /(^)X5. 

SC[n] 

The well-known hypercontractive inequality states as follows. 

Theorem B.l (Bonami-Beckner Inequality [69, 72]). Let 1 < p < q < 00 and let (3 > 0. Define 
7 = sjfi/fiq — 1) and 5 = y//3/(p — 1). Then for any function f : {0, l} n 1R, 

(B.l) ||T 7 /|| ? <||T 5 /|| p , 

where the norm is defined as 


1/9 


1 


£ l/MM =(E|/(x)|’) 1/ ’. 

xG{ 0 , 1 }" / 

By setting fi = p — 1 and q = 2, the hypercontractive inequality (B.l) can be rewritten as 

1/2 / \ i/p 


(B.2) 


^(p-l)l s l/(S) 2 ] < 

iSC[n] 


■ £ l/MI' 

16(0,1}" 
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